Licensed under the Creative Commons attribution-noncommercial license, http://creativecommons.org/licenses/by-nc/3.0/. Please share and remix noncommercially, mentioning its origin.
CC-BY_NC

These notes draw on material from http://kingaa.github.io/sbied/stochsim/stochsim.html



Produced with R version 3.2.3 and pomp version 1.3.1.2.


Objectives

  1. To explain the simplest particle filter, allowing Monte Carlo solution of the POMP filtering and prediction recursions and therefore providing computation of the likelihood.

  2. To provide examples of visualizing and exploring likelihood surfaces using the particle filter for computation of the likelihood.

  3. To consider how to apply the fundamental tools of likelihood-based inference in situations where the likelihood cannot be written down explicitly but can be evaluated and maximized via Monte Carlo methods.

  4. To gain some experience at carrying out likelihood-based inferences for dynamic models using simulation-based statistical methodology in the R package pomp.

  5. Understand how iterated filtering algorithms carry out repeated particle filtering operations, with randomly perturbed parameter values, in order to maximize the likelihood.

  6. Gain experience carrying out statistical investigations using iterated filtering in a relatively simple situation (fitting an SIR model to a boarding school flu outbreak).

13.1 Theory of the particle filter

13.1.1 Indirect specification of the statistical model via a simulation procedure

  • For simple statistical models, we may describe the model by explicitly writing the density function \(f_{Y_{1:N}}(y_{1:N}{\, ; \,}\theta)\). One may then ask how to simulate a random variable \(Y_{1:N}\sim f_{Y_{1:N}}(y_{1:N}{\, ; \,}\theta)\).

  • For many dynamic models it is convenient to define the model via a procedure to simulate the random variable \(Y_{1:N}\). This implicitly defines the corresponding density \(f_{Y_{1:N}}(y_{1:N}{\, ; \,}\theta)\). For a complicated simulation procedure, it may be difficult or impossible to write down \(f_{Y_{1:N}}(y_{1:N}{\, ; \,}\theta)\) exactly.

  • It is important for us to bear in mind that the likelihood function exists even when we don’t know what it is! We can still talk about the likelihood function, and develop numerical methods that take advantage of its statistical properties.




13.1.2 Special case: a deterministic unobserved state process

  • Lets’ begin with a special case where the unobserved state process is deterministic. That is, \(X_{n}=x_n(\theta)\) is a known function of \(\theta\) for each \(n\). What is the likelihood?

  • Since the distribution of the observable random variable, \(Y_n\), depends only on \(X_n\) and \(\theta\), and since, in particular \(Y_{m}\) and \(Y_{n}\) are independent given \(X_{m}\) and \(X_{n}\), we have \[{\mathscr{L}}(\theta) = \prod_{n} f_{Y_n|X_n}(y_n^*{{\, | \,}}x_n(\theta){\, ; \,}\theta)\] or \[{\ell}(\theta) = \log{\mathscr{L}}(\theta) = \sum_{n} \log f_{Y_n|X_n}(y_n^*{{\, | \,}}x_n(\theta){\, ; \,}\theta).\]




13.2 Likelihood for stochastic models by direct simulation

bsflu <- read.table("bsflu_data.txt")

sir_step <- Csnippet("
  double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt));
  double dN_IR = rbinom(I,1-exp(-gamma*dt));
  S -= dN_SI;
  I += dN_SI - dN_IR;
  R += dN_IR;
  H += dN_IR;
")

sir_init <- Csnippet("
  S = nearbyint(N)-1;
  I = 1;
  R = 0;
  H = 0;
")

dmeas <- Csnippet("lik = dbinom(B,H,rho,give_log);")
rmeas <- Csnippet("B = rbinom(H,rho);")

pomp(bsflu,times="day",t0=0,
     rprocess=euler.sim(sir_step,delta.t=1/5),
     initializer=sir_init,rmeasure=rmeas,dmeasure=dmeas,
     zeronames="H",statenames=c("H","S","I","R"),
     paramnames=c("Beta","gamma","rho","N")) -> sir
simulate(sir,params=c(Beta=2,gamma=1,rho=0.5,N=2600),
         nsim=10000,states=TRUE) -> x
matplot(time(sir),t(x["H",1:50,]),type='l',lty=1,
        xlab="time",ylab="H",bty='l',col='blue')
lines(time(sir),obs(sir,"B"),lwd=2,col='black')

ell <- dmeasure(sir,y=obs(sir),x=x,times=time(sir),log=TRUE,
                params=c(Beta=2,gamma=1,rho=0.5,N=2600))
dim(ell)
## [1] 10000    14
ell <- apply(ell,1,sum); summary(exp(ell)); logmeanexp(ell,se=TRUE)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.00e+00  0.00e+00  0.00e+00 5.37e-106  0.00e+00 5.37e-102
##                    se 
## -242.39328   16.09139

13.3 The particle filter

  1. Suppose \(X_{n-1,j}^{F}\), \(j=1,\dots,J\) is a set of \(J\) points drawn from the filtering distribution at time \(n-1\).

  2. We obtain a sample \(X_{n,j}^{P}\) of points drawn from the prediction distribution at time \(t\) by simply simulating the process model: \[X_{n,j}^{P} \sim \mathrm{process}(X_{n-1,j}^{F},\theta), \qquad j=1,\dots,J.\]

  3. Having obtained \(x_{n,j}^{P}\), we obtain a sample of points from the filtering distribution at time \(t_n\) by resampling from \(\big\{X_{n,j}^{P},j\in 1:J\big\}\) with weights \[w_{n,j}=f_{Y_n|X_n}{y^*_{n}|X^P_{n,j}{\, ; \,}\theta}.\]

  4. The Monte Carlo principle tells us that the conditional likelihood \[\begin{eqnarray} {\mathscr{L}}_n(\theta) &=& f_{Y_n|Y_{1:n-1}}{y^*_n|y^*_{1:n-1}{\, ; \,}\theta} \\ &=& \int f_{Y_n|X_n}(y^*_{n}|x_{n}{\, ; \,}\theta)\,f_{X_n|Y_{1:n-1}}(x_{n}|y^*_{1:n-1}{\, ; \,}\theta)\, dx_n. \end{eqnarray} \] is approximated by \[\hat{\mathscr{L}}_n(\theta) \approx \frac{1}{N}\,\sum_j\, f_{Y_n|X_n}(y^*_{n}|X_{n,j}^{P}{\, ; \,}\theta).\]

  5. We can iterate this procedure through the data, one step at a time, alternately simulating and resampling, until we reach \(n=N\).

  6. The full log likelihood then has approximation \[\begin{eqnarray}{\ell}(\theta) &=& \log{{\mathscr{L}}(\theta)} \\ &=& \sum_n \log{{\mathscr{L}}_n(\theta)} \\ &\approx& \sum_n\log\hat{\mathscr{L}}_n(\theta). \end{eqnarray} \]

13.4 Sequential Monte Carlo in pomp

sims <- simulate(sir,params=c(Beta=2,gamma=1,rho=0.8,N=2600),nsim=20,
                 as.data.frame=TRUE,include.data=TRUE)

ggplot(sims,mapping=aes(x=time,y=B,group=sim,color=sim=="data"))+
  geom_line()+guides(color=FALSE)

pf <- pfilter(sir,Np=5000,params=c(Beta=2,gamma=1,rho=0.8,N=2600))
logLik(pf)
## [1] -87.82817
pf <- replicate(10,pfilter(sir,Np=5000,params=c(Beta=2,gamma=1,rho=0.8,N=2600)))
ll <- sapply(pf,logLik); ll
##  [1] -84.65377 -82.83597 -78.01854 -73.30744 -81.52721 -86.99616 -80.03618
##  [8] -75.22000 -81.15559 -84.17097
logmeanexp(ll,se=TRUE)
##                    se 
## -75.462762   1.778989

13.5 The graph of the likelihood function: The likelihood surface

sliceDesign(
  c(Beta=2,gamma=1,rho=0.8,N=2600),
  Beta=rep(seq(from=0.5,to=4,length=40),each=3),
  gamma=rep(seq(from=0.5,to=2,length=40),each=3)) -> p

require(foreach)
require(doMC)
registerDoMC(cores=5)        ## number of cores on this machine

set.seed(998468235L,kind="L'Ecuyer")
mcopts <- list(preschedule=FALSE,set.seed=TRUE)

foreach (theta=iter(p,"row"),.combine=rbind,
         .inorder=FALSE,.options.multicore=mcopts) %dopar% 
 {
   pfilter(sir,params=unlist(theta),Np=5000) -> pf
   theta$loglik <- logLik(pf)
   theta
 } -> p




13.5.1 Exercise: Write down the definition of a likelihood slice in mathematical notation.

13.5.2 Exercise: Note that a likelihood slice is different from a likelihood profile

  • What is the difference between a likelihood slice and a profile from a computational perspective?

  • What is the difference between the statistical interpretation of a likelihood slice and a profile. For example, compare their roles in constructing confidence intervals and hypothesis tests.

  • What is the purpose of computing a likelihood slice? What is the purpose of computing a likelihood profile?




  • Note that the slice computation used the foreach package with the multicore backend (doMC) to parallelize these computations.

  • To ensure that we have high-quality random numbers in each parallel R session, we use a parallel random number generator. Notice the specification kind="L'Ecuyer" when we set the seed for the random number generator, and the specification .options.multicore=mcopts for foreach, the parallel for loop

  • Now, we can plot the likelihood slices:

foreach (v=c("Beta","gamma")) %do% 
{
  x <- subset(p,slice==v)
  plot(x[[v]],x$loglik,xlab=v,ylab="loglik")
}


13.5.3 Exercise: Likelihood slices

Add likelihood slices along the \(\rho\) direction.


Slices offer a very limited perspective on the geometry of the likelihood surface. With just two parameters, we can evaluate the likelihood at a grid of points and visualize the surface directly.

expand.grid(Beta=seq(from=1,to=4,length=50),
            gamma=seq(from=0.7,to=3,length=50),
            rho=0.8,
            N=2600) -> p

foreach (theta=iter(p,"row"),.combine=rbind,
         .inorder=FALSE,.options.multicore=mcopts) %dopar% 
 {
   pfilter(sir,params=unlist(theta),Np=5000) -> pf
   theta$loglik <- logLik(pf)
   theta
 } -> p
pp <- mutate(p,loglik=ifelse(loglik>max(loglik)-100,loglik,NA))
ggplot(data=pp,mapping=aes(x=Beta,y=gamma,z=loglik,fill=loglik))+
  geom_tile(color=NA)+
  geom_contour(color='black',binwidth=3)+
  scale_fill_gradient()+
  labs(x=expression(beta),y=expression(gamma))


13.5.4 Exercise: 2D likelihood slice

Compute a slice of the likelihood in the \(\beta\)-\(N\) plane.


13.6 Maximizing the likelihood using the particle filter

13.6.1 Causal, mechanistic interpretation of parameter estimates

  • When we write down a mechanistic model for a system, we have some idea of what we intend parameters to mean. In epidemiology, for example, we interpret parameters as a reporting rate, a contact rate between individuals, an immigration rate, a duration of immunity, etc.

  • The data and the parameter estimation procedure do not know about our intended interpretation of the model. It can and does happen that some parameter estimates statistically consistent with the data may be scientifically absurd according to the scientific reasoning that went into building the model.

  • This can arise as a consequence of weak identifiability.

  • It can also be a warning that the data do not agree that our model represents reality in the way we had hoped. Perhaps more work is needed on model development.

  • Scientifically unreasonable parameter estimates can sometimes be avoided by fixing some parameters at known, reasonable values. However, this risks suppressing the warning that the data were trying to give about weaknesses in the model, or in the biological interpretation of it.

  • This issue will be discussed further when it arises in case studies.




13.6.2 Likelihood maximization for the boarding school flu example

  • We saw above that the default parameter set for the ‘bsflu’ pomp object is not particularly close to the MLE.

  • One way to find the MLE is to try optimizing the estimated likelihood, computed by the particle filter, directly.

  • There are many optimization algorithms to choose from, and many implemented in R.

  • Three issues arise immediately.

  1. The particle filter gives us a stochastic estimate of the likelihood. We can reduce this variability by making the number of particles, Np, larger. However, we cannot make it go away. If we use a deterministic optimizer (i.e., one that assumes the objective function is evaluated deterministically), then we must control this variability somehow. For example, we can fix the seed of the pseudo-random number generator. A side effect will be that the objective function becomes jagged, marked by many small local knolls and pits. Alternatively, we can use a stochastic optimization algorithm, with which we will be only be able to obtain estimates of our MLE. This is the trade-off between a noisy and a rough objective function.

  2. Because the particle filter gives us just an estimate of the likelihood and no information about the derivative, we must choose an algorithm that is “derivative-free”. There are many such, but we can expect less efficiency than would be possible with derivative information. Note that finite differencing is not an especially promising way of constructing derivatives. The price would be a \(n\)-fold increase in cpu time, where \(n\) is the dimension of the parameter space. Also, since the likelihood is only estimated, we would expect the derivative estimates to be noisy.

  3. Finally, the parameters are constrained to be positive, and \(\rho < 1\). We must therefore select an optimizer that can solve this constrained maximization problem, or figure out some of way of turning it into an unconstrained maximization problem. For the latter purpose, one can transform the parameters onto a scale on which there are no constraints.

First, let’s try deterministic optimization of a rough function. We’ll try using optim’s default method: Nelder-Mead, fixing the random-number generator seed to make the likelihood calculation deterministic.




13.7 An iterated filtering algorithm (IF2)

13.7.1 IF2 algorithm pseudocode

model input: Simulators for \(f_{X_0}(x_0;\theta)\) and \(f_{X_n|X_{n-1}}(x_n| x_{n-1}; \theta)\); evaluator for \(f_{Y_n|X_n}(y_n| x_n;\theta)\); data, \(y^*_{1:N}\)

algorithmic parameters: Number of iterations, \(M\); number of particles, \(J\); initial parameter swarm, \(\{\Theta^0_j, j=1,\dots,J\}\); perturbation density, \(h_n(\theta|\varphi;\sigma)\); perturbation scale, \(\sigma_{1{:}M}\)

output: Final parameter swarm, \(\{\Theta^M_j, j=1,\dots,J\}\)

  1. \(\quad\) For \(m\) in \(1{:} M\)
  2. \(\quad\quad\quad\) \(\Theta^{F,m}_{0,j}\sim h_0(\theta|\Theta^{m-1}_{j}; \sigma_m)\) for \(j\) in \(1{:} J\)
  3. \(\quad\quad\quad\) \(X_{0,j}^{F,m}\sim f_{X_0}(x_0 ; \Theta^{F,m}_{0,j})\) for \(j\) in \(1{:} J\)
  4. \(\quad\quad\quad\) For \(n\) in \(1{:} N\)
  5. \(\quad\quad\quad\quad\quad\) \(\Theta^{P,m}_{n,j}\sim h_n(\theta|\Theta^{F,m}_{n-1,j},\sigma_m)\) for \(j\) in \(1{:} J\)
  6. \(\quad\quad\quad\quad\quad\) \(X_{n,j}^{P,m}\sim f_{X_n|X_{n-1}}(x_n | X^{F,m}_{n-1,j}; \Theta^{P,m}_j)\) for \(j\) in \(1{:} J\)
  7. \(\quad\quad\quad\quad\quad\) \(w_{n,j}^m = f_{Y_n|X_n}(y^*_n| X_{n,j}^{P,m} ; \Theta^{P,m}_{n,j})\) for \(j\) in \(1{:} J\)
  8. \(\quad\quad\quad\quad\quad\) Draw \(k_{1{:}J}\) with \(P[k_j=i]= w_{n,i}^m\Big/\sum_{u=1}^J w_{n,u}^m\)
  9. \(\quad\quad\quad\quad\quad\) \(\Theta^{F,m}_{n,j}=\Theta^{P,m}_{n,k_j}\) and \(X^{F,m}_{n,j}=X^{P,m}_{n,k_j}\) for \(j\) in \(1{:} J\)
  10. \(\quad\quad\quad\) End For
  11. \(\quad\quad\quad\) Set \(\Theta^{m}_{j}=\Theta^{F,m}_{N,j}\) for \(j\) in \(1{:} J\)
  12. \(\quad\) End For

comments:

  • The \(N\) loop (lines 4 through 10) is a basic particle filter applied to a model with stochastic perturbations to the parameters.

  • The \(M\) loop repeats this particle filter with decreasing perturbations.

  • The superscript \(F\) in \(\Theta^{F,m}_{n,j}\) and \(X^{F,m}_{n,j}\) denote solutions to the filtering problem, with the particles \(j=1,\dots,J\) providing a Monte Carlo representation of the conditional distribution at time \(n\) given data \(y^*_{1:n}\) for filtering iteration \(m\).

  • The superscript \(P\) in \(\Theta^{P,m}_{n,j}\) and \(X^{P,m}_{n,j}\) denote solutions to the prediction problem, with the particles \(j=1,\dots,J\) providing a Monte Carlo representation of the conditional distribution at time \(n\) given data \(y^*_{1:n-1}\) for filtering iteration \(m\).

  • The weight \(w^m_{n,j}\) gives the likelihood of the data at time \(n\) for particle \(j\) in filtering iteration \(m\).

13.7.2 Choosing the algorithmic settings for IF2

  • The initial parameter swarm, \(\{ \Theta^0_j, j=1,\dots,J\}\), usually consists of \(J\) identical replications of some starting parameter vector.

  • \(J\) is set to be sufficient for particle filtering. By the time of the last iteration (\(m=M\)) one should not have effective sample size close to 1.

  • Perturbations are usually chosen to be Gaussian, with \(\sigma_m\) being a scale factor for iteration \(m\): \[h_n(\theta|\varphi;\sigma) \sim N[\varphi, \sigma^2_m V_n].\]
  • \(V_n\) is usually taken to be diagonal, \[ V_n = \left( \begin{array}{ccccc} v_{1,n}^2 & 0 & 0 & \rightarrow & 0 \\ 0 & v_{2,n}^2 & 0 & \rightarrow & 0 \\ 0 & 0 & v_{3,n}^2 & \rightarrow & 0 \\ \downarrow & & & \searrow & \downarrow \\ 0 & 0 & 0 & \rightarrow & v_{p,n}^2 \end{array}\right).\]
  • If \(\theta_i\) is a parameter that affects the dynamics or observations throughout the timeseries, it is called a regular parameter, and it is often appropriate to specify \[ v_{i,n} = v_i,\]
  • If \(\theta_j\) is a parameter that affects only the initial conditions of the dynamic model, it is called an initial value parameter (IVP) and it is appropriate to specify \[ v_{j,n} = \left\{\begin{array}{ll} v_j & \mbox{if $n=0$} \\ 0 & \mbox{if $n>0$} \end{array}\right.\]
  • If \(\theta_k\) is a break-point parameter that models how the system changes at time \(t_q\) then \(\theta_k\) is like an IVP at time \(t_q\) and it is appropriate to specify \[ v_{j,n} = \left\{\begin{array}{ll} v_j & \mbox{if $n=q$} \\ 0 & \mbox{if $n\neq q$} \end{array}\right.\]

  • \(\sigma_{1:M}\) is called a cooling schedule, following a thermodynamic analogy popularized by simulated annealing. As \(\sigma_m\) becomes small, the system cools toward a “freezing point”. If the algorithm is working sucessfully, the freezing point should be close to the lowest-energy state of the system, i.e., the MLE.

  • It is generally helpful for optimization to provide transformations of the parameters so that (on the estimation scale) they are real-valued and have uncertainty on the order of 1 unit. For example, one typically takes a logarithmic transformation of positive parameters and a logistic transformation of \([0,1]\) valued parameters.
  • On this scale, it is surprisingly often effective to take \[ v_i = 0.02\] for regular parameters (RPs) and \[ v_j = 0.1\] for initial value parameters (IVPs).

  • We suppose that \(\sigma_1=1\), since the scale of the parameters is addressed by the matrix \(V_n\) . Early on in an investigation, one might take \(M=100\) and \(\sigma_M=0.1\). Later on, consideration of diagnostic plots may suggest refinements.

  • It is surprising that useful general advice exists for these quantities that could in principle be highly model-specific. Here is one possible explanation: the precision of interest is often the second significant figure and there are often order 100 observations (10 monthly obsevations would be too few to fit a mechanistic model; 1000 would be unusual for an epidemiological system).




13.8 Applying IF2 to a boarding school influenza outbreak

bsflu_data <- read.table("bsflu_data.txt")
bsflu_statenames <- c("S","I","R1","R2")
bsflu_paramnames <- c("Beta","mu_I","rho","mu_R1","mu_R2")

The observation names (\(B\), \(C\)) are the names of the data variables:

(bsflu_obsnames <- colnames(bsflu_data)[1:2])
## [1] "B" "C"

We do not need a representation of \(R_3\) since the total population size is fixed at \(P=763\) and hence \(R_3(t)=P-S(t)-I(t)-R_1(t)-R_2(t)\). Now, we write the model code:

bsflu_dmeasure <- "
  lik = dpois(B,rho*R1+1e-6,give_log);
"

bsflu_rmeasure <- "
  B = rpois(rho*R1+1e-6);
  C = rpois(rho*R2);
"

bsflu_rprocess <- "
  double t1 = rbinom(S,1-exp(-Beta*I*dt));
  double t2 = rbinom(I,1-exp(-dt*mu_I));
  double t3 = rbinom(R1,1-exp(-dt*mu_R1));
  double t4 = rbinom(R2,1-exp(-dt*mu_R2));
  S -= t1;
  I += t1 - t2;
  R1 += t2 - t3;
  R2 += t3 - t4;
"

bsflu_fromEstimationScale <- "
 TBeta = exp(Beta);
 Tmu_I = exp(mu_I);
 Trho = expit(rho);
"

bsflu_toEstimationScale <- "
 TBeta = log(Beta);
 Tmu_I = log(mu_I);
 Trho = logit(rho);
"

bsflu_initializer <- "
 S=762;
 I=1;
 R1=0;
 R2=0;
"

We can now build the new bsflu pomp object.

require(pomp)
stopifnot(packageVersion("pomp")>="0.75-1")
bsflu2 <- pomp(
  data=bsflu_data,
  times="day",
  t0=0,
  rprocess=euler.sim(
    step.fun=Csnippet(bsflu_rprocess),
    delta.t=1/12
  ),
  rmeasure=Csnippet(bsflu_rmeasure),
  dmeasure=Csnippet(bsflu_dmeasure),
  fromEstimationScale=Csnippet(bsflu_fromEstimationScale),
  toEstimationScale=Csnippet(bsflu_toEstimationScale),
  obsnames = bsflu_obsnames,
  statenames=bsflu_statenames,
  paramnames=bsflu_paramnames,
  initializer=Csnippet(bsflu_initializer)
)
plot(bsflu2)

run_level <- 3
switch(run_level,
       {bsflu_Np=100; bsflu_Nmif=10; bsflu_Neval=10; bsflu_Nglobal=10; bsflu_Nlocal=10}, 
       {bsflu_Np=20000; bsflu_Nmif=100; bsflu_Neval=10; bsflu_Nglobal=10; bsflu_Nlocal=10}, 
       {bsflu_Np=60000; bsflu_Nmif=300; bsflu_Neval=10; bsflu_Nglobal=100; bsflu_Nlocal=20}
)

13.8.1 Running a particle filter

Before engaging in iterated filtering, it is often a good idea to check that the basic particle filter is working since iterated filtering builds on this technique. Here, carrying out slightly circular reasoning, we are going to test pfilter on a previously computed point estimate read in from bsflu_params.csv:

bsflu_params <- data.matrix(read.table("mif_bsflu_params.csv",row.names=NULL,header=TRUE))
bsflu_mle <- bsflu_params[which.max(bsflu_params[,"logLik"]),][bsflu_paramnames]

We are going to treat \(\mu_{R_1}\) and \(\mu_{R_2}\) as known, fixed at the empirical mean of the bed-confinement and convalescence times for symptomatic cases:

bsflu_fixed_params <- c(mu_R1=1/(sum(bsflu_data$B)/512),mu_R2=1/(sum(bsflu_data$C)/512))
  • It is convenient to do some parallelization to speed up the computations. Most machines are multi-core nowadays, and using this computational capacity involves only

    1. the following lines of code to let R know you plan to use multiple processors;

    2. using the parallel for loop provided by ‘foreach’.

require(doParallel)
cores <- 20
registerDoParallel(cores)
mcopts <- list(set.seed=TRUE)

set.seed(396658101,kind="L'Ecuyer")
  • We proceed to carry out replicated particle filters at this tentative MLE:
stew(file=sprintf("pf-%d.rda",run_level),{
  t_pf <- system.time(
    pf <- foreach(i=1:20,.packages='pomp',
                  .options.multicore=mcopts) %dopar% try(
                    pfilter(bsflu2,params=bsflu_mle,Np=bsflu_Np)
                  )
  )
  
},seed=1320290398,kind="L'Ecuyer")

(L_pf <- logmeanexp(sapply(pf,logLik),se=TRUE))
##                      se 
## -73.3241678   0.3197182
  • In 9.4 seconds, we obtain an unbiased likelihood estimate of -73.32 with a Monte standard error of 0.32.

13.8.2 A local search of the likelihood surface

  • Let’s carry out a local search using mif2 around this previously identified MLE. For that, we need to set the rw.sd and cooling.fraction.50 algorithmic parameters:
bsflu_rw.sd <- 0.02
bsflu_cooling.fraction.50 <- 0.5

stew(file=sprintf("local_search-%d.rda",run_level),{
  
  t_local <- system.time({
    mifs_local <- foreach(i=1:bsflu_Nlocal,.packages='pomp', .combine=c, .options.multicore=mcopts) %dopar%  {
      mif2(
        bsflu2,
        start=bsflu_mle,
        Np=bsflu_Np,
        Nmif=bsflu_Nmif,
        cooling.type="geometric",
        cooling.fraction.50=bsflu_cooling.fraction.50,
        transform=TRUE,
        rw.sd=rw.sd(
          Beta=bsflu_rw.sd,
          mu_I=bsflu_rw.sd,
          rho=bsflu_rw.sd
        )
      )
      
    }
  })
  
},seed=900242057,kind="L'Ecuyer")
  • Although the filtering carried out by mif2 in the final filtering iteration generates an approximation to the likelihood at the resulting point estimate, this is not usually good enough for reliable inference. Partly, this is because some parameter perturbations remain in the last filtering iteration. Partly, this is because mif2 is usually carried out with a smaller number of particles than is necessary for a good likelihood evaluation—the errors in mif2 average out over many iterations of the filtering.

  • Therefore, we evaluate the likelihood, together with a standard error, using replicated particle filters at each point estimate:

stew(file=sprintf("lik_local-%d.rda",run_level),{
    t_local_eval <- system.time({
    liks_local <- foreach(i=1:bsflu_Nlocal,.packages='pomp',.combine=rbind) %dopar% {
      evals <- replicate(bsflu_Neval, logLik(pfilter(bsflu2,params=coef(mifs_local[[i]]),Np=bsflu_Np)))
      logmeanexp(evals, se=TRUE)
    }
  })
},seed=900242057,kind="L'Ecuyer")

results_local <- data.frame(logLik=liks_local[,1],logLik_se=liks_local[,2],t(sapply(mifs_local,coef)))
summary(results_local$logLik,digits=5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -74.543 -74.243 -74.038 -74.010 -73.930 -73.301
  • This investigation took 48.1 minutes for the maximization and 1.5 minutes for the likelihood evaluation. These repeated stochastic maximizations can also show us the geometry of the likelihood surface in a neighborhood of this point estimate:
pairs(~logLik+Beta+mu_I+rho,data=subset(results_local,logLik>max(logLik)-50))




13.9 A global search of the likelihood surface using randomized starting values

bsflu_box <- rbind(
  Beta=c(0.001,0.01),
  mu_I=c(0.5,2),
  rho = c(0.5,1)
)
stew(file=sprintf("box_eval-%d.rda",run_level),{
  
  t_global <- system.time({
    mifs_global <- foreach(i=1:bsflu_Nglobal,.packages='pomp', .combine=c, .options.multicore=mcopts) %dopar%  mif2(
      mifs_local[[1]],
      start=c(apply(bsflu_box,1,function(x)runif(1,x[1],x[2])),bsflu_fixed_params)
    )
  })
},seed=1270401374,kind="L'Ecuyer")
stew(file=sprintf("lik_global_eval-%d.rda",run_level),{
  t_global_eval <- system.time({
    liks_global <- foreach(i=1:bsflu_Nglobal,.packages='pomp',.combine=rbind, .options.multicore=mcopts) %dopar% {
      evals <- replicate(bsflu_Neval, logLik(pfilter(bsflu2,params=coef(mifs_global[[i]]),Np=bsflu_Np)))
      logmeanexp(evals, se=TRUE)
    }
  })
},seed=442141592,kind="L'Ecuyer")

results_global <- data.frame(logLik=liks_global[,1],logLik_se=liks_global[,2],t(sapply(mifs_global,coef)))
summary(results_global$logLik,digits=5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -75.537 -74.527 -74.299 -74.268 -74.030 -71.913
if (run_level>2) 
  write.table(rbind(results_local,results_global),
              file="mif_bsflu_params.csv",append=TRUE,col.names=FALSE,row.names=FALSE)
pairs(~logLik+Beta+mu_I+rho,data=subset(results_global,logLik>max(logLik)-250))

13.10 Exercises

13.10.1 Exercise: Construct a profile likelihood

  • How strong is the evidence about the contact rate, \(\beta\), given this model and data? Use mif2 to construct a profile likelihood. Due to time constraints, you may be able to compute only a preliminary version.

  • It is also possible to profile over the basic reproduction number, \(R_0=\beta P/\mu_I\). Is this more or less well determined that \(\beta\) for this model and data?

13.10.2 Exercise: Check the model source code

  • Check the source code for the bsflu pomp object. Does the code implement the model described?

  • For various reasons, it can be surprisingly hard to make sure that the written equations and the code are perfectly matched. Here are some things to think about:

  1. Papers should be written to be readable to as broad a community as possible. Code must be written to run successfully. People do not want to clutter papers with numerical details which they hope and belief are scientifically irrelevant. What problems can arise due to this, and what solutions are available?

  2. Suppose that there is an error in the coding of rprocess. Suppose that plug-and-play statistical methodology is used to infer parameters. A conscientious researcher carries out a simulation study, using simulate to generate some realizations from the fitted model and checking that the inference methodology can successfully recover the known parameters for this model, up to some statistical error. Will this procedure help to identify the error in rprocess? If not, how does one debug rprocess? What research practices help minimize the risk of errors in simulation code?

13.10.3 Exercise: Assessing and improving algorithmic parameters

Develop your own heuristics to try to improve the performance of mif2 in the previous example. Specifically, for a global optimization procedure carried out using random starting values in the specified box, let \(\hat\Theta_\mathrm{max}\) be a random Monte Carlo estimate of the resulting MLE, and let \(\hat\theta\) be the true (unknown) MLE. We can define the maximization error in the log likelihood to be \[e = \ell(\hat\theta) - E[\ell(\hat\Theta_\mathrm{max})].\] We cannot directly evaluate \(e\), since there is also Monte Carlo error in our evaluation of \(\ell(\theta)\), but we can compute it up to a known precision. Plan some code to estimates \(e\) for a search procedure using a computational effort of \(JM=2\times 10^7\), comparable to that used for each mif computation in the global search. Discuss the strengths and weaknesses of this quantification of optimization success. See if you can choose \(J\) and \(M\) subject to this constraint, together with choices of rw.sd and the cooling rate, cooling.fraction.50, to arrive at a quantifiably better procedure. Computationally, you may not be readily able to run your full procedure, but you could run a quicker version of it.

13.10.4 Exercise: Finding sharp peaks in the likelihood surface

  • Even in this small, 3 parameter, example, it takes a considerable amount of computation to find the global maximum (with values of \(\beta\) around 0.004) starting from uniform draws in the specified box. The problem is that, on the scale on which “uniform” is defined, the peak around \(\beta\approx 0.004\) is very narrow. Propose and test a more favorable way to draw starting parameters for the global search, with better scale invariance properties.

13.10.5 Exercise: Adding a latent class

  • Modify the model to include a latent period between becoming exposed and becoming infectious. See what effect this has on the maximized likelihood.

References

Anonymous. 1978. Influenza in a boarding school. British Medical Journal 1:587.

Arulampalam, M. S., S. Maskell, N. Gordon, and T. Clapp. 2002. A tutorial on particle filters for online nonlinear, non-Gaussian Bayesian tracking. IEEE Trans. Sig. Proc. 50:174–188.

Doucet, A., N. de Freitas, and N. Gordon, editors. 2001. Sequential Monte Carlo Methods in Practice. Springer-Verlag, New York.

Ionides, E. L., D. Nguyen, Y. Atchadé, S. Stoev, and A. A. King. 2015. Inference for dynamic and latent variable models via iterated, perturbed Bayes maps. Proceedings of the National Academy of Sciences of USA 112:719–– 724.

Kitagawa, G. 1987. Non-Gaussian state-space modeling of nonstationary time series. Journal of the American Statistical Association 82:1032–1041.